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graphical  representation  of  high  dimensional  data  in  lower  dimensional  spaces  to  visually 
examine  to  scatter  of  the  data  and  detection  of  outliers. 

The  computations  involved  in  these  methods  and  the  interpretation  of  results  in  different 
situations  are  discussed.  The  difference  between  PCA  and  FA,  and  the  need  to  choose  the 
appropriate  technique  in  the  analysis  of  given  data  are  stressed. 

It  is  shown  that  there  is  a  close  similarity  between  the  growth  curve  models  used  in 
biometric  studies  and  the  arbitrage  pricing  theory  model  recently  introduced  in  financial 
statistics. 
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1.  Introduction 


Principal  component  and  factor  analyses  (PCA  and  FA)  are  exploratory  multivariate  tech¬ 
niques  used  in  studying  the  covariance  (or  correlation)  structure  of  measurements  made  on 
individuals.  The  object  may  vary  from  reduction  of  high  dimensional  data  by  finding  a  few 
latent  variables  which  explain  the  variations  of  or  the  associations  between  the  observable 
measurements,  grouping  of  similar  measurements  and  detecting  multicollinearity,  to  graphi¬ 
cal  representation  of  high  dimensional  data  in  lower  dimensional  spaces  to  visually  examine 
the  scatter  of  the  data,  and  detection  of  outliers.  PCA  was  developed  by  Pearson  (1901) 
and  Hotelling  (1933);  a  general  theory  with  some  extensions  and  applications  are  given  in 
Rao  (1964).  FA  originated  with  the  work  of  Spearman  (1904)  and  developed  by  Lawley 
(1940)  under  the  assumption  of  multivariate  normality.  A  general  theory  of  FA,  under  the 
title  Canonical  Factor  Analysis  (CFA),  without  any  distributional  assumptions  was  given 
by  Rao  (1955).  Now  there  are  a  number  of  excellent  full  length  monographs  devoted  to  the 
computational  aspects  and  uses  of  PCA  and  CA  in  social  and  physical  scientific  research. 
Reference  may  be  made  to  Bartholomew  (1987),  Basilevsky  (1994),  Cattel  (1978),  Jackson 
(1991),  Jolliffe  (1986)  to  mention  a  few  authors. 

A  technique  related  to  PCA,  when  the  measurements  are  qualitative,  is  correspondence 
analysis  (CA),  developed  by  Benzecri  (1973)  based  on  a  method  of  scaling  qualitative  cate¬ 
gories  suggested  by  Fisher  (1936).  A  monograph  by  Greenacre  (1984)  gives  the  theory  and 
applications  of  CA  in  the  analysis  of  contingency  tables.  A  recent  paper  by  Rao  (1995)  con¬ 
tains  an  alternative  to  CA,  which  seems  to  have  some  advantages  over  the  earlier  approach, 
for  the  same  purpose  CA  is  used. 

In  this  paper  a  general  survey  is  given  of  PCA  and  FA  with  some  recent  theoretical  results 
and  practical  applications. 


2.  Principal  Components 

2.1  The  general  problem 

The  problem  of  principal  components  can  be  stated  in  a  very  general  set  up  as  follows. 
Let  X  be  a  p-vector  variable  and  j/  be  a  q-vector  variable,  where  some  components  of  x  and 
y  may  be  the  same.  We  want  to  replace  y  by  z  =  Ay  where  A  is  an  r  x  ^  matrix  and  r  <  q 
in  such  a  way  that  the  loss  in  predicting  x  by  using  z  instead  y  is  as  minimal  as  possible.  If 

(s'  eO  (2.1) 

\  ^21  ^22  / 

is  the  covariance  matrix  of  x  and  y,  then  the  covariance  matrix  of  the  errors  in  predicting 
X  by  z  =  Ay  is 

W  =  S,i  -  Si2A'(AS22A')-^4E2i.  (2.2) 

We  choose  A  such  that  ||  W||,  for  a  suitably  chosen  norm,  is  small.  If  we  choose  ||  W||  =  tr  W, 
then  the  optimum  choice  is 

A.  =  argmaxtrSi2A'(AE22A')~*AS2i. 
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The  maximum  is  attained  at 

K={Cx:...Cr)  (2.3) 

where  C\,. . .  ,Cr  are  the  r  eigen  vectors  associated  with  the  first  r  eigen  values  A j  >  Aj  > 
. . .  >  A^  of  S21S12  with  respect  to  E22,  i-e.,  the  eigen  vectors  and  values  are  those  arising 
out  of  the  determinental  equation 

|S2i2i2-A2S22|  =0.  (2.4) 

The  relative  loss  of  information  in  using  2*  =  A*y  for  predicting  x  is 


tr(Sii  —  (A*E22j4t)  *^*S2iSi2>i(.)/tr Ell 

_  j  _  Ai  +  . .  ■  +  A^ 

trEii 


(2.5) 


We  consider  some  special  choices  of  x  and  y  and  derive  the  optimal  transformation  A  as 
characterized  in  (2.3). 


2.2  The  choice  x  =  y 


The  special  choice,  x  =  y,  leads  to  the  usual  principal  components  C^x, . . .  ,  C'x,  where 
Cl, . . .  ,  Cr  are  the  first  eigen  vectors  associated  with  the  first  eigen  values  Aj  >  . . .  >  A^  of 
the  determinantal  equation  [En  —  A/|  =  0.  In  such  a  case,  the  loss  of  information  (2.5)  is 

Af  +  . . .  +  A^  _  Aj+i  +  . . .  +  A^ 

‘■a;  +  ...  +  aj-  a;  +  ...  +  aj 

usually  expressed  as  a  percentage.  The  choice  of  r  is  determined  by  the  magnitude  of  (2.6). 

In  practice,  we  have  to  estimate  Xf  and  Ci  from  a  sample  of  n  independent  observations 
on  the  p- vector  random  variable  x,  which  we  denote  by  the  p  x  n  matrix 

.Y  =  (xi  :...:x„).  (2.7) 

An  estimate  of  En  is 

S={n-l)-^X{I--ee')X' 

n 

where  e  is  an  n-vector  of  unities.  The  estimates  ii  of  A^  and  c,  of  Ci  are  obtained  from  the 
spectral  decomposition 

s  =  £;c.c;  +  ...+<jc,c;.  (2.8) 

The  principal  components  of  the  observations  on  the  i-th  individual  are  then 

qi  =  (c'lXi,...  ,CpX.)'  (2.9) 
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In  the  sequel,  we  denote 


Sii  —  the  z-th  diagonal  element  of  5, 

Cj  {^ji  7  •  •  •  ?  ^jp)  7  i  1 , . . .  ,  p, 

(2.10.1) 

Cj{  —  £jCji^  %  —  1, . . ,  ,  p, 

(2.10.2) 

Q*  ~  (QH  7  *  •  •  7  Qip)  ,  2  =  1,  ‘  •  7  ^7 

(2.11.1) 

Qij  =  z  =  1,...  ,n. 

(2.11.2) 

It  may  be  noted  that  the  vectors  c,  and  (apart  from  a  translation  of  coordinates)  can  be 
obtained  in  one  step  from  the  singular  value  decomposition  (SVD) 


X{I- 


n 


ee')  =  £iCiq',  +  . . .  +4cpqJ,- 


(2.12) 


2.3  Interpretation  of  principal  components 

For  an  interpretation  of  principal  components  in  terms  of  the  influence  of  the  original 
measurements  on  them,  we  need  the  following  computations  as  exhibited  in  Table  1. 

The  magnitudes  of  the  correlations  in  Table  1  indicate  how  well  each  variable  is  represented 
in  each  PC  and  overall  in  the  first  r  PC’s  (judged  by  the  values  of  Rj).  The  values  of  Rf 
computed  for  r  =  1, 2, . . .  enable  us  to  decide  on  r,  the  number  of  PC’s  to  be  chosen.  If  for 
some  r,  the  values  of  Rf  are  high  except  for  one  value  of  i,  say  j,  then  we  may  decide  to 
include  Xj  along  with  zi, . . .  ,Zr  or  add  other  PC’s  where  Xj  is  well  represented. 

Table  1 


original  correlation  with  multiple  correlation 

variable  principal  component  of  Xi  on  2i, . . .  ,  Zr 

Zi  ...  Zp 


Xi 


,  Si/5u  =  R? 

j=i 


^'PP/s/^ 


j=i 
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2.4  Graphical  display  of  data 


To  represent  the  individuals  in  terms  of  the  original  measurements  we  need  a  p-dimensional 
space.  But  for  visual  examination,  we  need  a  plot  of  the  individuals  in  a  two  or  a  three  di¬ 
mensional  space,  which  reflects  the  configuration  of  the  individuals  in  the  p-space  (distances 
between  individuals)  to  the  extent  possible.  For  this  purpose,  we  use  the  PC’s  either  cis  in 
(2.11.1)  or  in  the  standardized  form  [SPC  as  in  (2.11.2)].  The  full  set  of  new  coordinates  in 
different  dimensions  from  which  first  few  may  be  selected  is  displayed  in  Table  2. 

Table  2 


individuals 

dim  1 

dim  2 

. .  dim  p 

PC  SPC 

PC  SPC 

PC  SPC 

1 

<711  <711 

9l2  9l2 

•  •  Qlp  9ip 

2 

921  921 

922  922 

92p  <72p 

n 

9nl  9»il 

9n2  9n2 

■  •  Qnp  qnp 

Variance 

Cj  1 

q  1 

..  ei  1 

If  we  plot  the  individuals  in  the  first  r(<  p)  dimensions  using  the  coordinates  qn,...  ,qir 
for  the  f-th  individual,  then  the  Euclidean  distance  between  the  individuals  i  and  j  in  such 
a  plot  will  be  an  approximation  to  the  Euclidean  distance  in  the  full  p-space 

dij  =  [(x,'  -Xj)'(xi  -Xj)]^/2 

On  the  other  hand,  if  we  plot  the  individuals  in  the  first  r(<  p)  dimensions  using  the 
coordinates  qui. . .  then  the  Euclidean  distance  between  individuals  i  and  j  in  such  a 
plot  will  be  an  approximation  to  the  Mahalanobis  distance  in  the  p-space 

dij  =  [(xi  -  XjfS-^Xi  -  Xj)]*/^ 

In  practice,  one  may  have  to  choose  the  appropriate  distance  we  want  to  preserve  in  the 
reduced  space.  Usually,  two  or  three  dimensional  plots  may  suffice  to  capture  the  original 
configuration.  If  more  than  three  dimensions  are  necessary,  other  graphical  displays  for 
visualizing  higher  dimensional  plots  may  be  used.  See  for  instance  the  paper  by  Wegman, 
Carr  and  Luo  (1993). 
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We  can  also  represent  the  variables  in  a  lower  dimensional  space  to  provide  a  visual 
examination  of  the  associations  between  them.  The  full  set  of  coordinates  for  this  purpose 
is  given  as  in  Table  3. 


Table  3 


coordinates 

Cii  C21  ...  Cpi 

C\2  C22  ■■■  Cp2 


P  Cip  C2p  . . .  Cpp 

Let  us  denote  the  vector  connecting  the  points  representing  the  z-th  individual  in  the  r- 
dimensional  space  to  the  origin  by  Vj.  Then  VjVj  is  a  good  approximation  to  the 
variance  of  the  z-th  variable  and  the  cosine  of  the  angle  between  the  vectors  Vj  and  vj  will 
be  a  good  approximation  of  the  correlation  between  the  z-th  and  j-th  variables. 

2.5  Analysis  of  residuals  and  detection  of  outliers 

If  we  retain  the  first  r  PC’s,  we  can  compute  the  error  in  the  approximation  x,  to  x,,  the 
p-vector  of  measurements  on  the  z-th  individual,  by 

X  -  X,-  =  (Cr+ic'^^l  +  .  .  .  +CpC'^)x 

and  an  overall  measure  of  difference  is 

d]  =  (x.-  -  X,)'(x.-  -  X.)  =  +  .  • .  +  <zFp- 

If  some  df  is  large  compared  to  the  others,  we  have  an  indication  that  Xj  may  be  an  outlier. 

Note  1.  The  PC’s  are  not  invariant  for  linear  transformations  of  the  original  variables. 
For  instance,  if  the  original  variables  are  scaled  by  different  numbers  or  if  they  are  rotated 
by  a  linear  transformation,  the  PC’s  will  be  different.  This  suggests  that  an  initial  decision 
has  to  be  made  on  transforming  the  original  measurements  to  a  new  set  and  then  extracting 
the  PC’s.  The  recommendation  usually  made  is  to  scale  the  measurements  by  the  inverse 
of  the  standard  deviations,  which  is  equivalent  to  finding  the  PC’s  based  on  the  correlation 
matrix  rather  than  the  covariance  matrix. 


variables 

1 

2 
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Note  2.  There  are  tests  available  on  the  eigen  values  and  eigen  vectors  of  a  covari¬ 
ance  matrix  when  the  original  measurements  have  a  multivariate  normal  distribution.  [See 
Chapter  4  of  Basilevsky  (1994)].  In  practice,  it  may  be  necessary  to  test  for  normality  of 
the  original  measurements  if  these  tests  are  to  be  applied.  It  may  be  useful  to  try  trans¬ 
formations  of  the  measurements  by  using  the  Box-Cox  family  of  transformations  to  induce 
normality  if  necessary.  Several  computer  programs  allow  for  this  option.  In  such  a  case,  we 
will  be  computing  the  PC’s  of  transformed  variables. 

Note  3.  In  some  problems  such  as  the  analysis  of  growth  curves,  the  PC’s  are  computed 
from  the  matrix  S  =  XX'  without  making  correction  for  the  mean.  The  references  to  such 
methods  are  Rao  (1958,  1987). 

Note  4.  It  has  been  suggested  by  Jolicoeur  and  Mosimann  (1960)  that  the  first  principal 
component,  which  has  the  maximum  variance,  may  be  interpreted  as  a  size  factor  provided 
all  the  coefficients  are  positive,  and  other  principal  components  with  positive  and  negative 
coefficients  as  shape  factors.  A  justification  for  such  an  interpretation  may  be  given  as 
follows.  Consider  the  i-th  variable  .t,  in  x  and  the  j-th  PC,  cjx  of  x.  The  regression  of 
Xi  on  CjX  is  Cji  the  i-th  element  in  the  j-th.  eigen  vector  cj.  Now  a  unit  increase  in  CjX 
produces  on  the  average  an  increase  Cji  in  Xi.  If  all  the  elements  in  cj  are  positive,  a  unit 
increase  in  c'tX  increeises  the  value  of  each  of  the  measurements,  in  which  case  c-x  may  be 

J  ^  ^  ^  J 

described  as  a  size  factor.  If  some  coefficients  are  positive  and  others  are  negative,  then  an 
increase  in  cjx  increases  the  values  of  some  measurements  and  decreases  the  values  of  the 
others,  in  which  case  cjx  may  be  interpreted  as  a  shape  factor. 

It  may  be  of  interest  to  note  that  if  all  the  original  measurements  are  non-negative, 
then  the  first  PC  of  the  uncorrected  sum  of  squares  and  products  matrix  will  have  all  its 
coefficients  non-negative. 

Note  5.  Another  particular  case  of  the  general  problem  stated  in  Section  2.1  is  when  x 
and  y  are  completely  different  sets  of  variables.  Such  a  situation  arises  when  we  have  a  large 
number  of  what  are  called  instrumental  variables  represented  by  y,  and  we  wish  to  predict 
each  dependent  variable  in  the  set  x  using  certain  linear  functions  y.  Such  a  procedure  may 
be  more  economical  and  sometimes  more  efficient  due  to  multicollinearity  in  y. 

2.6  Principal  components  of  x  uncorrelated  with  concomitant  variables  z 

In  some  problems  it  is  of  interest  to  find  the  principal  components  of  a  p-vector  x  uncor¬ 
related  with  a  (jf-vector  of  concomitant  variables  z.  Let 


denote  the  covariance  matrix  of  (x',  z'  )'  in  the  partitioned  form.  We  need  k  principal  com¬ 
ponents  LjX, ...  ,L^x  such  that  L'L,  =  1,  L'fLj  =  0  and  cov(L[x,z)  =  L[Si2  =  0,  i,j  = 
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1, . . .  and 

L;SLi  +  ...  +  L;.SLit  (2.14) 

is  a  maximum.  It  is  shown  in  Rao  (1964),  that  the  optimum  choice  of  Li, . . .  ,Lk  are  the 
first  k  right  eigen  vectors  of  the  matrix 

(/ —  Si2(S2iIIi2)~^S2i)Eii.  (2.15) 

As  an  application,  let  us  consider  a  p-vector  time  series  representing  some  blocks  of 
economic  transactions  considered  by  Stone  (1947). 


Economic  Time  periods 

transactions  1  2  . . .  T 


1 

X 1 1 

Xi2  ... 

.Tit 

p 

Concomitants 

Xp2  . . . 

XpT 

functions  of  time 

linear 

1 

2  ... 

T 

quadratic 

1 

22  ... 

T2 

We  compute  the  (p  +  2)  order  covariance  matrix  arising  out  of  the  main  variables  and 
concomitants,  considering  T  as  the  sample  size. 


(2.16) 


where  Sn  is  of  order  p  x  p,  S12  of  order  p  x  2  and  522  of  order  2x2. 
The  necessary  number  of  right  eigen  vectors  of 

(I-Su(S2,S,2r‘S2,)Sn 


(2.17) 


provide  principle  components  of  x  unaffected  by  linear  and  quadratic  trends  of  the  trans¬ 
actions  over  time.  Elimination  of  lower  order  or  higher  order  trends  is  possible  by  suitably 
choosing  the  concomitant  variables  as  powers  of  time. 

Stone  (1947)  considered  the  above  problem  of  isolating  linear  functions  of  x  which  have  an 
intrinsic  economic  significance  from  those  which  represent  trend  with  time  and  those  which 
measure  random  errors.  For  this  purpose  he  computed  the  covariance  matrix  of  x  variables 
alone  and  found  the  PC’s  using  the  eigen  vectors  of  the  5ii  part  of  the  matrix  without 
any  reference  to  the  time  factor.  The  problem  was  then  posed  as  that  of  identifying  the 
dominant  PC  which  accounted  for  a  large  variance.  This  was  interpreted  as  linear  trend  and 
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other  PC’s  were  interpreted  in  economic  terms.  It  is  believed  that  the  method  suggested  of 
obtaining  the  PC’s  using  the  matrix  (2.17)  is  more  flexible  and  provides  a  better  technique 
of  eliminating  trend  of  any  order  and  providing  linear  functions  with  intrinsic  economic 
significance. 


3.  Model  Based  Principal  Components 

3.1  An  analogy  with  the  factor  analytic  model 

Let  us  suppose  that  the  measurement  p- vector  x,  on  individual  i  can  be  expressed  as 

X,- =  a  +  Afi  +  e,-,  i  =  1, . . .  ,n  (3.1) 

where  a  is  a  p-vector  and  A  is  p  x  r  matrix  common  to  all  individuals,  f;  is  an  r-vector 
specific  to  individual  i,  and  e;  is  a  random  variable  such  that  E(ei)  =  0,  and  V^(ei)  =  <7^ 
for  i  =  1,...  ,n.  The  model  (3.1)  is  analogous  to  the  FA  model  except  that  in  FA  the 
covariance  matrix  of  e,-  is  diagonal  with  possibly  different  elements  (see  Section  4  of  the 
paper).  The  problem  we  consider  is  one  of  estimating  A,  fi, . . .  ,  fn  and  from  the  model 
(3.1).  Note  that  the  solution  is  not  unique  unless  we  impose  certain  restrictions  such  as  that 
the  columns  of  A  are  orthonormal.  We  can  write  the  joint  model  (3.1)  as 

X  =::ae'  +  AF  +  E  (3.2) 

where  X  =  (xi  :  . . .  :  Xn)  is  p  x  n  matrix,  e  is  an  n-vector  of  unites,  and  F  is  r  x  n  matrix. 
We  may  estimate  cv,  A  and  F  by  minimizing 

||A"  -  «e'  -  .4F|1  (3.3) 

for  an  appropriately  chosen  norm.  The  choice  of  Frobenius  norm  leads  to  an  extended 
method  of  least  squares  where  the  expression 


n 

^(Xj  -  a  -  Af{)'(x,-  -  a  -  Afi)  (3.4) 

i=l 

is  minimized  with  respect  to  o,  A  and  f;, . . .  ,  fn.  One  possible  solution  (see  Rao  (1995))  is 

d  =  X,  A  =  (ci  :  . . .  :  Cr) 

fi  =  A'(xi  -  x)  (3.5) 
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where  ci,...  ,Cr  are  the  first  r  eigen  vectors  of  5  =  X{I  —  ^ee')X'.  Then  f;  is  the  vector 
of  r  PC’s  for  the  individual  i.  We  thus  have  the  same  solution  as  that  discussed  in  Sections 
2.2  -  2.5.  An  estimate  of  cr^  is 


<7^  = 


n  —  1 


{n  —  r  —  l)(p  —  r) 


(^r+l  +  ■  ■  ■  +  ^p) 


(3.6) 


where  , . . .  ,£p  are  the  last  (p  —  r)  eigen  values  of  S. 

In  some  problems,  it  may  be  appropriate  to  consider  f;  in  the  model  (3.1)  as  a  random 
variable  with  the  identity  I  as  covariance  matrix.  In  such  a  case 


E{S)  =  AA' 

(3.7) 

an  estimate  of  A  is 

A  =  (^iCi  :  . . .  :  ^rCr)? 

(3.8) 

and  an  estimate  of  cr^  is 

iD  (3.9) 

which  are  the  same  as  in  (3.6)  except  for  scaling  factors.  If  it  is 
fi,  one  may  use  the  regression  of  f;  on  Xj  which  is  of  the  form 

desired  to  estimate  (predict) 

fi  =  A'{AA'  +  5-^/)“^(xi  -  x) 

(3.10) 

and  differs  from  the  expression  (3.5).  A  similar  situation  arises  when  we  want  to  estimate 
the  parameters  simultaneously  from  several  linear  models  having  the  same  design  matrix. 
Reference  may  be  made  to  Rao  (1975)  for  a  discussion  of  such  a  problem. 


3.2  Regression  problem  based  on  a  PC  model 

We  have  n  independent  observations  on  a  (p  +  l)-vector  random  variable  (y,  x),  where  x 
is  a  p-vector  and  y  is  a  scalar, 

(yi,xi),. . .  ,(y„,Xn)  (3.11) 

and  only  Xn+i  for  the  (n  +  l)-th  sample.  The  problem  is  to  predict  y„+i  the  unobserved 
value,  under  the  PC  model 


X,  =  «i  +  Afi  +  ei 

XJi  =  Of2  +  b'fi  +  Pi 

f  =  1, . . .  ,  n  +  1 
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(3.12) 

(3.13) 


where  cov(ei,77j)  =  0,  cov(ej)  =  cr^J,  V{rii)  =  (Tq,  and  the  rest  of  the  assumptions  are  the 
same  as  in  the  model  (3.1 ).  The  above  problem  was  considered  in  a  series  of  papers  (see  Rao 
(1975,  1976,  1978,  1987)  and  Rao  and  Boudreau  (1985)).  Recently,  the  model  (3.12  -  3.13) 
is  used  in  the  development  of  partial  least  squares  (see  Helland  (1988)  and  the  references 
there  in). 

There  are  several  possible  approaches  to  the  problem. 

1)  Let  fi , . . .  ,  fn+i  be  the  estimates  of  fi , . . .  ,  fn+i  using  the  observational  equations  (3.12) 
only.  Then  find  estimates  a2  and  b  of  02  and  b,  using  the  first  n  observational  equations  of 
(3.13)  and  assuming  fi, . . .  ,  fn  as  known,  by  the  usual  least  squares  method.  Finally  predict 
Vn+i  by  the  formula 

j/„+i  =d2  +  b'f„+i-  (3.14) 


2)  Let  di,d2,j4  and  b  be  the  estimates  of  ai,a2»'4  and  b  using  the  first  n  observational 
equations  in  (3.12)  and  (3.13).  Then  estimate  fn+i  using  the  equations 

Xn+l  =  di  +  .4fn+l  +  Cn+l  (3.15) 

assuming  cti  and  A  as  known,  by  the  least  squares  method.  If  f„+i  IS  the  estimate  of 
then  j/n+i  is  predicted  by 

y„+i  =  02  +  b'fn+i.  (3.16) 


3)  Substitute  a  value  say  y  for  xjn+i  to  make  the  equations  (3.12  -  3.13)  complete.  Then 
find  the  singular  value  decomposition  of  the  partitioned  matrix 

:  '  •  •  :  ’'"y  )(/-(•>  +  l)-‘ee’)  =  f.c, +  4+iCp+,q;.+, 

\yi . Un  u  J 

where  depend  on  y,  and  compute 

•Sr(!/)=<’?+i(!/)+--  +  fp+i('j)-  (3-17) 

Finally  predict  yn+i  as  the  value  of  y  which  minimizes  (3.17).  The  solution  may  be  obtained 
graphically  or  by  an  iterative  algorithm  as  described  in  Rao  and  Boudreau  (1985). 

4)  Another  method  is  to  consider  f;  as  a  random  variable  with  zero  mean  vector  and 
covariance  matrix  F.  Then 


X, 

Vi 


(ATA'  +  a\I  ATb  \ 
bTA'  b'Tb  +  <rlJ' 


cov  = 


12 


(3.18) 


Using  (3.12)  and  the  first  n  observational  equations  in  (3.13),  obtain  the  estimates  of 
.4,r,  b,<rj  and  Methods  described  by  Bentler  (1983),  Sorbom  (1974)  and  Rao  (1983, 
1985)  may  be  used  for  this  purpose.  Then  t/n+i  may  be  predicted  by 

y„+i  =  y  +  bTA'(yirA'  +  a2/)-i(x„+i  -  x)  (3.19) 

where  y  =  n~^Eyi,  x  =  (n  +  l)~^Sxi  and  for  b,r, ^4  and  af  their  estimates  are  substituted. 


4.  Factor  Analysis 


4.1  General  discussion 

In  FA,  a  p  vector  variable  x  is  endowed  with  a  stochastic  structure 

X  =  o  +  .4f  +  e  (4.1) 

where  a  is  a  p-vector  and  .4  is  p  x  r  matrix  of  parameters,  f  is  an  /’-vector  of  latent  vari¬ 
ables  called  common  factors  and  e  is  a  p-vector  of  variables  called  specific  factors,  with  the 
following  assumptions: 


E{e)  =  0,  cov(e)  =  A  a  diagonal  matrix 
E{f)  =  0,  cov(f,  e)  =  0,  cov(f)  =  I. 

As  a  consequence  of  (4.2),  we  have 


(4.2) 


S  =  cov(x)  =  AA'  -f-  A.  (4.3) 

Note  that  (4.3)  reduces  to  the  PC  model  considered  in  (3.1)  when  A  =  a^I.  The  problems 
generally  discussed  in  FA,  on  the  basis  of  n  independent  observations  xi, . . .  ,x„  made  on 
X,  are: 

1)  What  is  the  minimum  r  for  which  the  representation  (4.3)  holds? 

2)  How  do  we  estimate  A,  called  the  matrix  of  factor  loadings? 

3)  How  do  we  interpret  the  factors? 

4)  How  do  we  estimate  f  for  a  given  individual  given  the  observable  x? 

It  may  be  noted  that  the  equation  (4.3)  does  not  ensure  the  existence  of  a  unique  A  even 
for  a  given  r  and  so  also  f  in  (4.1).  However,  the  object  is  to  obtain  any  particular  solution, 
and  consider  transformations  of  A  and  f  for  an  interpretation.  References  to  a  discussion 
of  non-identifiability  of  A  and  f  and  rotation  of  factors  are  Basilevsky  (1994,  pp. 355-360, 
402-404),  Jackson  (1991,  pp.393-396),  Jolliffe  (1986,  pp.117-118). 
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Denoting  X  =  (xi , . . .  ,  Xn),  we  compute 

X  =  n~^Xe 
S  =  {n-  1)-^Y(/  - 


as  estimates  of  a  and  S.  Then  estimate  A  and  A  starting  with  5.  The  most  commonly 
used  method  is  maximum  likelihood  (ML)  under  the  assumption  of  multivariate  normality 
of  the  vector  variable  x.  There  are  a  number  of  computer  packages  for  the  estimation  of 
r,  the  number  of  factors,  .4,  the  matrix  of  factor  loadings  and  A,  the  matrix  of  specific 
factor  variances.  (See  for  instance  SPSS,  SAS,  OSIRIS,  BMD,  COFAMM  etc.,  which  also 
offer  alternatives  other  than  ML  estimates  and  also  compute  rotations  of  factor  loadings  for 
interpretation).  Let  us  denote  the  ML  estimates  of  A  and  A  by  A  and  A. 

The  likelihood  ratio  test  criterion  for  testing  the  hypothesis  that  there  r  common  factors 


IS 


-(n  -  l)log 


l-gl 

\AA'  +  A| 


(4.5) 


which  is  asymptotically  distributed  as  on  [(p  —  r)^  —  p  —  J']/2  degrees  of  freedom  in  large 
samples.  This  is  valid  under  the  assumption  of  multivariate  normality.  A  slight  improvement 
to  the  approximation  is  obtained  by  replacing  the  multiplier  (n  —  1)  in  (4.5)  by 


2p  +  5  2r 

n  -  I - - - — 

6  3 

An  alternative  method  called  canonical  factor  analysis  (CFA)  for  the  estimation  A  and 
A  is  developed  by  Rao  (1955)  without  making  any  distributional  assumptions.  The  solu¬ 
tion  turns  out  to  be  same  cis  the  ML  estimate.  However,  the  x'^-test  of  (4.5)  requires  the 
assumption  of  multivariate  normality. 

A  general  recommendation  is  to  test  for  multivariate  normality  l)ased  on  the  observed  data 
Xi , . . .  ,  x„  using  some  of  the  techniques  available  in  computer  packages.  Some  references 
to  a  discussion  of  tests  of  normality  are  Basilevsky  (1994,  Section  4.6.2)  and  Gnanadesikan 
(1977,  Section  5.4.2).  It  may  also  be  worthwhile  making  transformations  of  variables  to 
achieve  normality.  But  in  such  a  case  the  factor  structure  has  to  be  imposed  on  transformed 
variables. 

It  may  be  noted  that  unlike  PCA,  the  FA  is  invariant  under  scaling  of  variables,  if  one 
uses  scale  free  extraction  methods  such  as  the  ML  and  CFA.  In  these  cases,  one  can  use 
the  covariance  or  the  correlation  matrix  to  start  with.  If  the  covariance  matrix  is  used  and 
scales  vary  very  widely,  scale  factors  will  complicate  interpretation  of  results.  In  such  a  case, 
there  is  some  advantage  in  using  the  correlation  matrix.  The  covariance  matrix  is  preferable 
when  comparison  of  factor  structures  between  groups  is  involved  (see  Sorbom  (1974)). 
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4.2  Estimation  of  factor  scores 

Using  the  estimates  A  and  A  of  i4  and  A  in  the  representation  of  E,  we  can  estimate  the 
factor  score  f,-  of  the  z-th  individual  with  measurements  x,  by 

fi  =  +  A)-Hx.' -  x),  z'  =  l,...  ,n.  (4.7) 

The  expression  (4.7)  is  simply  the  regression  of  f,  on  x,  with  the  estimates  substituted  for 
the  unknowns.  There  are  other  expressions  suggested  for  the  estimates  of  factor  scores  (see 
Jackson  (1991,  p.409)). 


4.3  Prediction  problem 


We  consider  a  p  +  1  variable  (x,  y)  with  the  factor  structure 


X  =  o  +  .4f  +  e 


y  =  /?  +  a'f  4-  y 


(4.8) 


where  is  a  scalar,  a  is  an  r- vector  and  i]  is  such  that  E{i])  =  0,  cov(e,  y)  =  0,  U(y)  =  <5p^j. 
Suppose  that  we  have  observations  (xi ,  yi ), . . .  ,  (Xn,  yn)  on  n  individuals  and  only  Xn+i  on 
an  (n  +  l)-th  individual.  The  problem  is  to  predict  yn+i,  given  all  the  other  observations. 
By  considering  the  factor  structure  of  the  (p  +  1)- vector  variable 


(4.9) 


and  using  the  observations  (xi,yi),. . .  ,(xn,yn)  we  estimate  all  the  unknown  parameters. 
Let  d,/?,  .A,a,  A  and  be  estimates  of  the  corresponding  parameters  using  the  CFA  or 
ML- method.  Then  the  regression  estimate  of  yn+i  given  Xn+i  is 


IJ  =  fS  +  k' A'(Aa' +  A)  *(x„+i-d). 


(4.10) 


In  this  case,  we  are  not  utilizing  the  information  provided  by  x„+i,  on  the  parameters  a,  A 
and  A. 


4.4  What  is  the  difference  between  PC  A  and  FA  ? 

In  PCA,  we  do  not  impose  any  structure  on  the  p- vector  random  variable  x.  Suppose  that 
E{x)  =  0  and  cov(x)  =  S.  We  wish  to  replace  x  by  a  smaller  number  of  linear  combinations 
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y  =  X'x  where  L  is  p  x  r  matrix  of  rank  r.  Then  the  predicted  value  of  x  given  y  (i.e.,  the 
regression  of  x  on  y)  is 

x  =  si:(i:'SL)-V  (4.11) 

and  the  covariance  matrix  of  the  residual  x  —  x  is 

2-ST(L'SL)-'L'S.  (4.12) 

We  wish  to  choose  L  to  minimize  a  suitable  norm  of  (4.12).  The  choice  of  Frobenius  norm 
leads  to  the  solution 

I  =  (ci  :...:c.)  (4.13) 

where  Ci,...  ,Cr  are  the  first  r  eigen  vectors  of  S  in  which  case  L'x  represents  the  first 
r  principal  components  as  explained  in  Section  3.  The  aim  is  to  account  for  the  entire 
covariance  matrix  of  x,  to  the  extent  possible,  in  terms  of  a  reduced  number  of  variables. 

In  FA,  we  are  fitting  an  expression  of  the  type  AA'  +  A  to  /I,  the  correlation  matrix 
of  the  p-vector  variable  x.  Since  A  is  a  diagonal  matrix  of  free  parameters,  the  matrix 
A  is  virtually  determined  by  minimizing  the  differences  between  the  off  diagonal  elements 
of  AA'  and  R.  Thus,  the  matrix  of  factor  loadings  is  designed  to  explain  the  correlations 
between  the  observed  variables.  The  variances  in  the  variables  unexplained  by  the  factors, 
irrespective  of  their  magnitudes,  is  characterized  as  specific  variances.  In  PCA,  the  emphasis 
is  more  on  explaining  the  overall  variances  arising  out  of  common  and  specific  factors.  Thus, 
the  objectives  of  PCA  and  CA  are  different  and  so  are  the  solutions. 

Note  1.  Fitting  an  expression  of  the  type  AA'  +  A  to  imposes  an  automatic  upper 
bound  to  r,  the  number  of  factors.  So,  in  a  given  situation,  one  is  forced  to  interpret  the 
data  in  terms  of  far  fewer  factors  than  those  that  may  have  influenced  the  data.  In  the  CFA 
developed  by  the  author  (Rao  (1955)),  no  limit  is  placed  on  the  number  of  common  factors, 
but  the  method  allows  for  the  requisite  number  of  dominant  factors  to  be  extracted  from 
the  data.  No  fixed  number  of  factors  is  postulated  to  begin  with,  and  the  problem  is  treated 
as  one  of  estimation  rather  than  testing  of  hypothesis  on  the  number  of  factors. 

Note  2.  It  may  be  of  interest  to  note  that  in  the  formulation  of  the  FA  model,  only  the 
second  order  properties  of  the  common  and  specific  factors  are  used.  However,  if  we  demand 
independence  of  the  distribution  of  all  these  variables,  the  problem  becomes  more  complex 
as  the  following  theorem  proved  in  Rao  (1969,  1973)  shows. 

Theorem;  Let  x  be  a  vector  random  variable  with  a  linear  structure  x  =  Ay,  where  y 
is  a  vector  of  independent  r.v.’s.  Then  x  admits  the  decomposition 

X  =  Xi  +  X2 


where  Xi  and  X2  are  independent,  Xi  has  essentially  a  unique  structure  (xi  =  Aiyi  with  a 
unique  Ai  apart  from  scaling  and  y  1  as  a  vector  of  a  fixed  number  of  independent  non-normal 
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variables)  and  X2  has  a  p-variate  normal  distribution  with  a  nori-unique  linear  structure 
(x2  =  B2y2  with  B2  not  necessarily  unique  and  y2  as  a  vector  of  independent  univariate 
normal  variables). 

In  view  of  this  theorem,  if  some  of  the  factors  have  a  non-normal  distribution,  the  unique¬ 
ness  of  Ai  automatically  specifies  a  lower  bound  to  the  number  of  factor  variables  which 
may  have  no  relationship  with  p.  The  limitations  placed  on  the  FA  model  by  considering 
only  second  order  properties  of  the  variables  involved  need  some  investigation. 


4.5  The  arbitrage  pricing  theory  model  (APT) 


The  classical  FA  model  is  extended  to  a  statistical  model  of  the  APT  by  Ross  (1976), 
which  is  similar  to  the  growth  curve  model  of  Rao  (1958,  equation  9,  Section  3).  Consider 
the  usual  FA  model,  using  the  notation  used  in  the  finance  literature 


R  —  p  -|-  u 


(4.18) 


where  R  denotes  the  A'-vector  of  returns  on  N  assets,  p  =  ■^'(R),  -E'(f)  =  0,  F^(u)  = 
0,  jF(fu')  =  0,  cov(f)  =  $  and  cov(  u)  =  A,  a  diagonal  matrix.  The  matrix  B  of  order  N  x  k 
is  the  matrix  of  factor  loadings.  [In  the  earlier  sections  p  is  used  for  N  and  r  for  k].  From 
the  assumptions  made 

E  =  cov(R)  =  B^B'  +  A  (4.19) 


Now,  we  model  p  as 


p  =  Rfe  +  BX 


(4.20) 


where  Rj  is  described  as  the  riskless  return  on  a  riskless  asset.  The  sample  we  have  over  T 
time  periods  is 

(Ri,i<!/i), ...  ,(Rt,R/t)  (4.21) 


where  in  (4.21),  Rf  is  known  and  varies  over  time  and  A  is  A:-vector  of  unknown  parameters 
called  the  factor  premiums.  Writing  rt  =  Rt  —  B/tg,  we  can  write  the  model  for  the  t-th. 
observation  as 


rt  —  5(ft  -|-  A)  -}-  Ut,  f  —  1, . . .  ,  T 


(4.22) 


which  is  exactly  the  model  considered  in  Rao  (1958).  The  marginal  model  for  rt  is 


rt  =  5A  -f- Vt,  t  =  1,...  ,T 


(4.23) 


with  cov(vt)  =  S.  If  R  and  E  are  known,  the  least  squares  estimate  of  A  is 

A  =  (R'E-‘R)-^R'E-‘r  (4.24) 

where  f  =  r“*(ri  -t-  . . .  -|-  rr)-  If  B  and  E  are  not  known,  it  is  suggested  by  Roll  and 
Ross  (1980)  and  also  Rao  (1958)  that  they  can  be  estimated  by  ML  or  an  appropriate 
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nonparametric  method  considering  the  model  (4.18)  with  unrestricted  ^  as  discussed  in 
section  4.2  of  this  article  and  substituted  in  (4.23).  If  multivariate  normality  is  assumed  for 
the  distribution  of  f  and  u  in  the  model  (4.18),  it  is  possible  to  write  down  the  likelihood  for 
all  the  unknown  parameters  i?,  A,  $  and  A  based  on  the  observations  ri , . . .  ,  rr  and  obtain 
the  ML  estimates  for  all  the  unknown  parameters.  We  can  then  also  apply  likelihood  ratio 
tests  for  the  specification  of  S,  i.e.,  for  the  number  of  factors,  and  the  structure  (4.20)  on 
H-  Such  a  procedure  is  fully  worked  out  in  Christensen  (...),  where  the  method  is  applied  to 
New  York  Stock  Exchange  data. 


5.  Conclusions 

Both  PCA  and  FA  may  be  considered  as  multivariate  methods  for  exploratory  data  anal¬ 
ysis.  The  aim  of  both  the  analyses  is  to  understand  the  structure  of  the  data,  through 
reducing  the  number  of  variables,  which  in  some  sense  can  replace  the  original  data  and 
which  are  easier  to  study  through  graphical  representation  and  multivariate  inference  tech¬ 
niques.  Some  caution  is  necessary  as  there  are  many  decisions  to  be  made  on  the  number 
of  reduced  variables  and  the  criterion  by  which  adequacy  of  the  reduced  set  of  variables  in 
representing  the  whole  set  of  original  variables  is  judged. 

Some  practioners  consider  PCA  and  FA  as  alternative  techniques  of  multivariate  data 
analysis  intended  to  answer  the  same  questions.  It  is  also  claimed  that  each  technique  has 
evolved  into  a  useful  data  -  analytic  tool  and  has  become  an  invaluable  aid  to  other  statistical 
models  such  as  cluster  and  discriminant  analysis,  least  squares  regression,  graphical  data 
displays,  and  so  forth.  As  discussed  in  the  present  article,  the  purposes  of  reduction  of 
data  in  PCA  and  FA  are  different.  In  PCA,  the  reduced  data  is  intended  to  approximate, 
to  the  maximum  possible  extent,  the  dispersion  of  the  original  data  in  terms  of  the  entire 
covariance  matrix,  while  in  FA,  the  emphasis  is  on  explaining  the  correlations  or  association 
between  the  original  variables.  The  objectives  are  different  and  a  decision  has  to  be  made 
as  to  the  appropriateness  of  PCA  or  FA  in  a  particular  situation  and  the  purpose  of  data 
analysis.  While  the  roles  of  PCA  and  FA  in  exploratory  data  analysis  are  clear,  the  exact 
uses  of  the  estimated  PC’s  and  factors  in  inferential  data  analysis,  or  in  planning  further 
investigations  do  not  seem  to  be  satisfactorily  laid  out. 

Some  conditions  under  which  the  factor  scores  and  principal  components  are  close  to  each 
other  have  been  given  by  Schneeweiss  and  Mathes  (1955).  It  would  be  of  interest  to  pursue 
such  theoretical  investigations  and  also  examine  in  individual  data  sets  the  actual  differences 
between  principal  components  and  factor  scores. 
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